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Abstract 

Large,  sparse  binary  matrices  arise  in  numerous  data  mining  applications,  such  as  the  analysis  of  market  baskets, 
web  graphs,  social  networks,  co-citations,  as  well  as  information  retrieval,  collaborative  filtering,  sparse  matrix 
reordering,  etc.  Virtually  all  popular  methods  for  the  analysis  of  such  matrices — e.g.,  k-mcans  clustering,  METIS 
graph  partitioning,  SVD/PCA  and  frequent  itemset  mining — require  the  user  to  specify  various  parameters,  such 
as  the  number  of  clusters,  number  of  principal  components,  number  of  partitions,  and  “support.”  Choosing  suitable 
values  for  such  parameters  is  a  challenging  problem. 

Cross-association  is  a  joint  decomposition  of  a  binary  matrix  into  disjoint  row  and  column  groups  such  that  the 
rectangular  intersections  of  groups  are  homogeneous.  Starting  from  first  principles,  we  furnish  a  clear,  information- 
theoretic  criterion  to  choose  a  good  cross-association  as  well  as  its  parameters,  namely,  the  number  of  row  and 
column  groups.  We  provide  scalable  algorithms  to  approach  the  optimal.  Our  algorithm  is  parameter-free ,  and 
requires  no  user  intervention.  In  practice  it  scales  linearly  with  the  problem  size,  and  is  thus  applicable  to  very 
large  matrices.  Finally,  we  present  experiments  on  multiple  synthetic  and  real-life  datasets,  where  our  method 
gives  high-quality,  intuitive  results. 
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(a)  Original  matrix 


(b)  Iteration  pair  1 


(c)  Iteration  pair  2 
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Figure  1:  Searching  for  cross-associations:  Starting  with  the  original  matrix  (plot  (a)),  our  algorithm  successively 
increases  the  number  of  groups.  At  each  stage,  starting  with  the  current  arrangement  into  groups,  rows  and  columns 
are  rearranged  to  improve  the  code  cost. 

1  Introduction  -  Motivation 

Large,  sparse  binary  matrices  arise  in  many  applications,  under  several  guises.  Consequently,  because  of  its  impor¬ 
tance  and  prevalence,  the  problem  of  discovering  structure  in  binary  matrices  has  been  widely  studied  in  several 
domains:!  I )  Market  basket  analysis  and  frequent  itemsets:  The  rows  of  the  matrix  represent  customers  (or  trans¬ 
actions)  and  the  columns  represent  products.  Entry  ( i ,  j)  of  the  matrix  is  1  if  customer  i  purchased  product  j  and  0 
otherwise.  (2)  Information  retrieval:  Rows  correspond  to  documents,  columns  to  words  and  an  entry  in  the  matrix 
represent  whether  a  certain  word  is  present  in  a  document  or  not.  (3)  Graph  partitioning  and  community  detection: 
Rows  and  columns  correspond  to  source  and  target  objects  and  matrix  entries  represent  links  from  a  source  to  a 
destination.  (4)  Collaborative  filtering,  microarray  analysis,  and  numerous  other  applications — in  fact,  any  setting 
that  has  a  many-to-many  relationship  (in  database  terminology)  in  which  we  need  to  find  patterns. 

We  ideally  want  a  method  that  discovers  structure  in  such  datasets  and  has  the  following  main  properties: 

(PI)  It  is  fully  automatic;  in  particular,  we  want  a  principled  and  intuitive  problem  formulation,  such  that  the  user 
does  not  need  to  set  any  parameters. 

(P2)  It  simultaneously  discovers  both  row  and  column  groups. 

(P3)  It  scales  up  for  large  matrices. 


Cross-association  and  Our  Contributions  The  fundamental  question  in  mining  large,  sparse  binary  matrices  is 
whether  there  is  any  underlying  structure.  In  these  cases,  the  labels  (or,  equivalently,  the  ordering)  of  the  rows  and 
columns  is  immaterial.  The  binary  matrix  contains  information  about  associations  between  objects,  irrespective  of 
their  labeling.  Intuitively,  we  seek  row  and  column  groupings  (equivalently,  labellings)  that  reveal  the  underlying 
structure.  We  can  group  rows,  based  on  some  notion  of  “similarity”  and  we  could  do  the  same  for  columns.  Better 
yet,  we  would  like  to  simultaneously  find  row  and  column  groups,  which  divide  the  matrix  into  rectangular  regions 
as  “similar”  or  “homogeneous”  as  possible.  These  intersections  of  row  and  column  groups,  or  cross-associations, 
succinctly  summarize  the  underlying  structure  of  object  associations.  The  corresponding  rectangular  regions  of 
varying  density  can  be  used  to  quickly  navigate  through  the  structure  of  the  matrix. 

In  short,  we  would  like  a  method  that  will  take  as  input  a  matrix  like  in  Figure  1(a),  and  will  quickly  and 
automatically  (i)  determine  a  good  number  of  row  groups  k  and  column  groups  l  and  (ii)  re-order  the  rows  and 
columns,  to  reveal  the  hidden  structure  of  the  matrix,  like  in  Figure  1(e). 

We  propose  a  method  that  has  precisely  the  above  properties:  it  requires  no  “magic  numbers,”  discovers  row  and 
column  groups  simultaneously  (see  Figure  1)  and  scales  linearly  with  the  problem  size.  We  introduce  a  novel 
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approach  and  propose  a  general,  intuitive  model  founded  on  compression  and  information-theoretic  principles.  In 
particular,  unlike  existing  methods,  we  employ  lossless  compression  and  always  operate  at  a  zero-distortion  level. 
Thus,  we  can  use  the  MDL  principle  to  automatically  select  the  number  of  row  and  column  groups.  We  provide  an 
integrated  framework  to  automatically  find  cross-associations.  Also,  our  method  is  easily  extensible  to  matrices 
with  categorical  values. 

In  Section  2,  we  survey  the  related  work.  In  Section  3,  we  formulate  our  data  description  model  starting  from 
first  principles.  Based  on  this,  in  Section  4  we  develop  an  efficient,  parameter-free  algorithm  to  discover  cross¬ 
associations.  In  Section  5  we  evaluate  cross-associations  demonstrating  good  results  on  several  real  and  synthetic 
datasets.  Finally,  we  conclude  in  Section  6. 

2  Survey 

In  general,  there  arc  numerous  settings  where  we  want  to  find  patterns,  correlations  and  rules.  There  are  several 
time-tested  tools  for  most  of  these  tasks.  Next,  we  discuss  several  of  these  approaches,  dividing  them  broadly  into 
application  domains.  However,  with  few  exceptions,  all  require  tuning  and  human  intervention,  thus  failing  on 
property  (PI). 

Clustering  We  discuss  work  in  the  “traditional”  clustering  setting  first.  By  that  we  mean  approaches  for  grouping 
along  the  row  dimension  only:  given  a  collection  of  n  points  in  m  dimensions,  find  “groupings”  of  the  n  points. 
This  setting  makes  sense  in  several  domains  (for  example,  if  the  m  dimensions  have  an  inherent  ordering),  but  it  is 
different  from  our  problem  setting. 

Also,  most  of  the  algorithms  assume  a  user-given  parameter.  For  example,  the  most  popular  approach,  A- 
means  clustering,  requires  k  from  the  user.  The  problem  of  finding  k  is  a  difficult  one  and  has  attracted  attention 
recently;  for  example  X-means  [1]  uses  BIC  to  determine  k.  Another  more  recent  approach  is  G-means  [2],  which 
assumes  a  mixture  of  Gaussians  (often  a  reasonable  assumption,  but  which  may  not  hold  for  binary  matrices). 
Other  interesting  valiants  of  A-means  that  improve  clustering  quality  is  /e-harmonic  means  [3]  (which  still  requires 
k)  and  spherical  A-means  (e.g.,  see  [4]),  which  applies  to  binary  data  but  still  focuses  on  clustering  along  one 
dimension).  Finally,  there  are  many  other  recent  clustering  algorithms  (CURE  [5],  BIRCH  [6],  Chameleon  [7], 
[8];  see  also  [9]). 

Several  of  the  clustering  methods  might  suffer  from  the  dimensionality  curse  (like  the  ones  that  require  a 
co- variance  matrix);  others  may  not  scale  up  for  large  datasets. 

Information  Co-clustering  (ITCC)  [10]  is  a  recent  algorithm  for  simultaneously  clustering  rows  and  columns  of  a 
normalized  contingency  table  or  a  two-dimensional  probability  distribution.  Cross-associations  (CA)  also  simul¬ 
taneously  group  rows  and  columns  of  a  binary  (or  categorical)  matrix  and,  at  the  surface,  bear  similarity  to  ITCC. 
However,  the  two  approaches  are  quite  different: 

(1)  For  each  rectangular  intersection  of  a  row  cluster  with  a  column  cluster,  CA  constructs  a  lossless  code, 
whereas  ITCC  constructs  a  lossy  code  that  can  be  thought  of  as  a  rank-one  matrix  approximation. 

(2)  ITCC  generates  a  progressively  finer  approximation  of  the  original  matrix.  More  specifically,  as  the  num¬ 
ber  of  row  and  column  clusters  arc  increased,  the  Kullback-Leibler  divergence  (or,  KL-divergence)  between  the 
original  matrix  and  its  lossy  approximation  tends  to  zero.  In  contrast,  regardless  of  the  number  of  clusters,  CA 
always  losslessy  transmits  the  entire  matrix.  In  other  words,  as  the  number  of  row  and  column  clusters  arc  in¬ 
creased,  ITCC  tries  to  sweep  an  underlying  rate-distortion  curve,  where  the  rate  depends  upon  the  number  of  row 
and  column  clusters  and  distortion  is  the  KL-divergence  between  the  original  matrix  and  its  lossy  approximation. 
In  comparison,  CA  always  operates  at  zero  distortion. 

(3)  While  both  ITCC  and  CA  use  alternating  minimization  techniques,  ITCC  minimizes  the  KL-divergence 
between  the  original  matrix  and  its  lossy  approximation,  while  CA  minimizes  the  resulting  codelength  for  the 
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original  matrix. 

(4)  As  our  key  contribution,  in  CA,  we  use  the  MDL  principle  to  automatically  select  the  number  of  row  and 
column  clusters.  While  MDL  is  well  known  for  lossless  coding  which  is  the  domain  of  CA,  no  MDL-like  principle 
is  yet  known  for  lossy  coding;  for  a  very  recent  proposal  towards  this  direction,  see  [11].  As  a  result,  selecting  the 
number  of  row  and  column  clusters  in  ITCC  is  still  an  art.  Note  that  ITCC  is  similar  in  spirit  to  the  Information 
Bottleneck  formulation  [12]. 

Thus,  to  the  best  of  our  knowledge,  our  method  is  the  first  to  study  explicitly  the  problem  of  parameter-free, 
joint  clustering  of  large  binary  matrices. 

Market-basket  analysis  /  frequent  itemsets  Frequent  itemset  mining  brought  a  revolution  [13]  with  a  lot  of 
follow-up  work  [9,  14].  However,  they  require  the  user  to  specify  a  “support.”  The  work  on  “interestingness”  is 
related  [15],  but  still  does  not  answer  the  question  of  “support.” 

Information  retrieval  and  LSI  The  pioneering  method  of  LSI  [16]  uses  SVD  on  the  term-document  matrix. 
Again,  the  number  k  of  eigenvectors/concepts  to  keep  is  up  to  the  user  ([16]  empirically  suggest  about  200  con¬ 
cepts).  Additional  matrix  decompositions  include  the  Semi-Discrete  Decomposition  (SDD)  [17],  PLSA  [18],  the 
clever  use  of  random  projections  to  accelerate  SVD  [19],  and  many  more.  However,  they  all  fail  on  property  (PI). 

Graph  partitioning  The  prevailing  methods  are  METIS  [20]  and  spectral  partitioning  [21].  These  approaches 
have  attracted  a  lot  of  interest  and  attention;  however,  both  need  the  user  to  specify  k,  that  is,  the  number  of  pieces 
to  break  the  graph  into.  Moreover,  they  typically  also  require  a  measure  of  imbalance  between  the  two  pieces  of 
each  split. 

Other  domains  Related  to  graphs  in  several  settings  is  the  work  on  conjunctive  clustering  [22] — which  requires 
density  (i.e.,  “homogeneity”)  and  overlap  parameters — as  well  as  community  detection  [23],  among  many.  Finally, 
there  arc  several  approaches  to  cluster  micro-array  data  (e.g.,  [24]). 

In  conclusion,  the  above  methods  miss  one  or  more  of  our  prerequisites,  typically  (PI).  Next,  we  present  our 
method. 


3  Cross-association  and  Compression 

Our  goal  is  to  find  patterns  in  a  large,  binary  matrix,  with  no  user  intervention,  as  shown  in  Figure  1 .  How  should  we 
decide  the  number  of  row  and  column  groups  (k  and  l,  respectively)  along  with  the  assignments  of  rows/columns 
to  their  “proper”  groups? 

We  introduce  a  novel  approach  and  propose  a  general,  intuitive  model  founded  on  compression,  and  more 
specifically,  on  the  MDL  ( Minimum  Description  Language)  principle  [25].  The  idea  is  the  following:  the  binary 
matrix  represents  associations  between  objects  (corresponding  to  rows  and  columns).  We  want  to  somehow  sum¬ 
marize  these  in  cross-associations,  i.e.,  homogeneous,  rectangular  regions  of  high  and  low  densities.  At  the  very 
extreme,  we  can  have  rn  x  n  “rectangles,”  each  really  being  an  element  of  the  original  matrix,  and  having  “density” 
of  either  0  or  1.  Then,  each  rectangle  needs  no  further  description.  At  the  other  extreme,  we  can  have  one  rectangle, 
with  a  density  in  the  range  from  0  to  1.  However,  neither  really  is  a  summary  of  the  data.  So,  the  question  is,  how 
many  rectangles  should  we  have?  The  idea  is  that  we  penalize  the  number  of  rectangles,  i.e.,  the  complexity  of  the 
data  description.  We  do  this  in  a  principled  manner,  based  on  a  novel  application  of  the  MDL  philosophy  (where 
the  costs  arc  based  on  the  number  of  bits  required  to  transmit  both  the  “summary”  of  the  structure,  as  well  as  each 
rectangular  region,  given  the  structure). 
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Symbol 

Definition 

D 

Binary  data  matrix 

m,  n 

Dimensions  of  D  (rows,  columns) 

k,£ 

Number  of  row  and  column  groups 

k* ,  t 

Optimal  number  of  groups 

(*,*) 

Cross-association 

Pi,j 

Cross-associate  (submatrix) 

dii  bj 

Dimensions  of  DUJ 

n{Ditj ) 

Number  of  elements  n(Dt  j)  :=  aibj 

Number  of  0,  1  elements  in  DltJ 

pDi,MiPDi,i(X) 

Densities  of  0,  1  in  DltJ 

H(p) 

Binary  Shannon  entropy  function 

C(Dij) 

Code  cost  for  D,tj 

T(D;k,  £,*,*) 

Total  cost  for  D 

Table  1 :  Table  of  main  symbols. 


This  is  an  intuitive  and  very  general  model  of  the  data,  that  requires  no  parameters.  Our  model  allows  us  to 
find  good  cross-associations  automatically.  Next,  we  describe  the  theoretical  underpinnings  in  detail. 

3.1  Cross-association 

Let  D  =  [rijj]  denote  a  rn  x  n  (rn,  n  >  1)  binary  data  matrix.  Let  us  index  the  rows  as  1,2,...,  rn  and  columns 
as  1,  2, . . . ,  n. 

Let  k  denote  the  desired  number  of  disjoint  row  groups  and  let  l  denote  the  desired  number  of  disjoint  column 
groups.  Let  us  index  the  row  groups  by  1.  2, ... ,  k  and  the  column  groups  by  1,2,...,  1.  Let 

T'  :  {1, 2, . . . ,  m}  — ►  {1,2 

$:{l,2,...,ra}-{l,  2, 

denote  the  assignments  of  rows  to  row  groups  and  columns  to  column  groups,  respectively.  We  refer  to  { T .  $ } 
as  a  cross-association.  To  gain  further  intuition  about  a  given  cross-association,  given  row  groups  ^  and  column 
groups  <5,  let  us  rearrange  the  underlying  data  matrix  I)  such  that  all  rows  corresponding  to  group  1  arc  listed  first, 
followed  by  rows  in  group  2,  and  so  on.  Similarly,  let  us  rearrange  D  such  that  all  columns  corresponding  to  group 
1  arc  listed  first,  followed  by  columns  in  group  2,  and  so  on.  Such  a  rearrangement,  implicitly,  sub-divides  the 
matrix  D  into  smaller  two-dimensional,  rectangular  blocks.  We  refer  to  each  such  sub-matrix  as  a  cross-associate, 
and  denote  them  as  Dy,  i  =  1 .....  A:  and  j  =  Let  the  dimensions  of  D,tJ  be  (a,,  bj). 


3.2  A  Lossless  Code  for  a  Binary  Matrix 

With  the  intent  of  establishing  a  close  connection  between  cross-association  and  compression,  we  first  describe  a 
lossless  code  for  a  binary  matrix.  There  arc  several  possible  models  and  algorithms  for  encoding  a  binary  matrix. 
With  hindsight,  we  have  simply  chosen  a  code  that  allows  us  to  build  an  efficient  and  analyzable  cross-association 
algorithm.  Throughout  this  paper,  all  logarithms  arc  base  2  and  all  code  lengths  arc  in  bits. 

Let  A  denote  an  a  x  b  binary  matrix.  Define 

n  ]  ( ,4 )  :=  number  of  nonzero  entries  in  A 


4 


no ( ,4 )  :=  number  of  zero  entries  in  A 
n(A )  :=  ni(A)  +  no(A)  =  a  x  b 
PA{i)  :=  rii(A)/n(A),  i  =  0, 1. 

Intuitively,  we  model  the  matrix  A  such  that  its  elements  arc  drawn  in  an  i.i.d.  fashion  according  to  the  distri¬ 
bution  PA-  Given  the  knowledge  of  the  matrix  dimensions  (a,  b)  and  the  distribution  PA,  we  can  encode  A  as 
follows.  Scan  A  in  a  fixed,  predetermined  ordering.  Whenever  i,  i  =  0, 1  is  encountered,  it  can  be  encoded  us¬ 
ing  —  log  PA(i)  bits,  on  average.  The  total  number  of  bits  sent  (this  can  also  be  achieved  in  practice  using,  e.g., 
arithmetic  coding  [26,  27,  28])  will  be 

C(A)  :=  J2  ni(A)  lQg  )  =  n(A)H(PA( 0)) ,  (1) 

where  H  is  the  binary  Shannon  entropy  function. 

For  example,  consider  the  matrix 

'  1  0  0  0  ' 

0  0  1  0 

-  0  10  0  ' 

_  0  0  0  1  _ 

In  this  case,  n\(A)  =  4,  no(A)  =  12,  n(A)  =  16,  PA(1)  =  1/4,  7/4(0)  =  3/4.  We  can  encode  each  0 
element  using  roughly  log(4/3)  bits  and  each  1  element  using  roughly  log  4  bits.  The  total  code  length  for  A  is: 
4  *  log4  +  12  *  log4/3  =  16  *  77(1/4). 

3.3  Cross-association  and  Compression 

We  now  make  precise  the  link  between  cross-association  and  compression.  Let  us  suppose  that  we  are  interested 
in  transmitting  (or  storing)  the  data  matrix  D  of  size  m  x  n  (m,  n  >  1),  and  would  like  to  do  so  as  efficiently  as 
possible.  Let  us  also  suppose  that  we  are  given  a  cross-association  (T.  $)  of  D  into  k  row  groups  and  £  column 
groups,  with  none  of  them  empty. 

With  these  assumptions,  we  now  describe  a  two-part  code  for  the  matrix  D.  The  first  paid  will  be  a  description 
complexity  involved  in  describing  the  cross-association  ( T .  T ) .  The  second  part  will  be  the  actual  code  for  the 
matrix,  given  the  cross-association. 

3.3.1  Description  Complexity 

The  description  complexity  in  transmitting  the  cross-association  shall  consist  of  the  following  terms: 

1.  Send  the  matrix  dimensions  m  and  n  using,  e.g.,  log*(m)  +  log*(n),  where  log*  is  the  universal  code  length 
for  integers1.  However,  this  term  is  independent  of  the  cross-association.  Hence,  while  useful  for  actual 
transmission  of  the  data,  it  will  not  figure  in  our  framework. 

2.  Send  the  row  and  column  permutations  using,  e.g.,  rri \ log  m]  and  n  [ log  n]  bits,  respectively.  This  term  is 
also  independent  of  any  given  cross-association. 

3.  Send  the  number  of  groups  ( k ,  £)  using  log*  k  +  log*  £  bits  (or  alternatively,  using  [log  m]  +  [log  n]  bits). 

1  It  can  be  shown  that  log*(a:)  «  log2(a;)  +  log2  log2(*)  +  . . .,  where  only  the  positive  terms  are  retained  and  this  is  the  optimal  length, 
if  we  do  not  know  the  range  of  values  for  x  beforehand  [29] 
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4.  Send  the  number  of  rows  in  each  row  group  and  also  the  number  of  columns  in  each  column  group.  Let  us 
suppose  that  ai  >  02  >  . . .  >  ak  >  1  and  b\  >  62  >  . . .  >  be  >  1.  Compute 

—  k  +  i,  i  =  1, . . . ,  k  —  1 

~t  +  j,  3  = 

Now,  the  desired  quantities  can  be  sent  using  the  following  number  of  bits: 

fc-i  1- 1 

[Tog  o*l 

i= 1  3=1 

5.  For  each  cross-associate  Dl  j,  i  =  1, . . . ,  k  and  j  =  1, . . . , £,  send  ri\(Dl  j),  i.e.,  the  number  of  ones  in  the 
matrix,  using  \log(atbj  +  1)]  bits. 

3.3.2  The  Code  for  the  Matrix 

Let  us  now  suppose  that  the  entire  preamble  specified  above  has  been  sent.  We  now  transmit  the  each  of  the  actual 
cross-associates  D,j,  i  =  1, . . . ,  k  and  j  =  1 .... .  £,  using  C(Dij)  bits  according  to  Eq.  1. 

3.3.3  Putting  It  Together 

We  can  now  write  the  total  code  length  for  the  matrix  D,  with  respect  to  a  given  cross-association  as: 

fc-i  t- 1 

T(D;  k,  £,&,$)  :=  log*  k  +  log*  £  +  [log  a*]  +  ^[log^] 

*= 1  3= 1 

k  l  k  l 

+  r \°g(aibj  +  1)1  +  C(Dij),  (2) 

1=1  j=  1  2—1  j= 1 

where  we  ignore  the  costs  log*(m)  +  log*(n)  and  mflogm]  +  raflogn],  since  they  do  not  depend  upon  the  given 
cross-association. 

3.4  Problem  Formulation 

An  optimal  cross-association  corresponds  to  the  number  of  row  groups  k*,  the  number  of  column  groups  £*,  and 
a  cross-association  ('F*,  $*)  such  that  the  total  resulting  code  length,  namely,  T(D;  k*,  £*,  'F*,  <F*)  is  minimized. 
Typically,  such  problems  arc  computationally  hard.  Hence,  in  this  paper,  we  shall  pursue  feasible  practical  strate¬ 
gies.  To  determine  the  optimal  cross-association,  we  must  determine  both  the  number  of  row  and  columns  groups 
and  also  a  corresponding  cross-association.  We  break  this  joint  problem  into  two  related  components:  (i)  finding 
a  good  cross-association  for  a  given  number  of  row  and  column  groups;  and  (ii)  searching  for  the  number  of  row 
and  column  groups.  In  Section  4.1  we  describe  an  alternating  minimization  algorithm  to  find  an  optimal  cross¬ 
association  for  a  fixed  number  of  row  and  column  groups.  In  Section  4.2,  we  outline  an  effective  heuristic  strategy 
that  searches  over  k  and  £  to  minimize  the  total  code  length  T.  This  heuristic  is  integrated  with  the  minimization 
algorithm. 
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Original  matrix 


100  200  300  400  500  600 

Column  Clusters 


Iteration  1  (rows) 
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Iteration  2  (cols) 
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Iteration  4  (cols) 
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(a)  Original  groups 


(b)  Row  shifts  (Step  2) 


(c)  Column  shifts  (Step  4)  (d)  Column  shifts  (Step  4) 


Figure  2:  Row  and  column  shifting:  Flolding  k  and  t  fixed  (here,  k  =  l  =  3),  we  repeatedly  apply  Steps  2  and  4 
of  ReGroup  until  no  improvements  are  possible  (Step  6).  Iteration  3  (Step  2)  is  omitted,  since  it  performs  no 
swapping.  To  potentially  decrease  the  cost  further,  we  must  increase  k  or  £  or  both,  as  in  Figure  1. 

4  Algorithms 

In  the  previous  section  we  established  our  goal:  Among  all  possible  k  and  l  values,  and  all  possible  row-  and 
column-groups,  pick  the  arrangement  with  the  smallest  total  compression  cost,  as  MDL  suggests  (model  plus 
data).  Although  theoretically  pleasing,  Eq.  2  does  not  tell  us  how  to  go  about  Ending  the  best  arrangement — it  can 
only  pinpoint  the  best  one,  among  several  candidates.  The  question  is  how  to  generate  good  candidates. 

We  answer  this  question  in  two  steps: 

1.  ReGroup  (inner  loop):  For  a  given  k  and  £,  hnd  a  good  arrangement  (i.e.,  cross-association). 

2.  CrossAssociationSearch  (outer  loop):  Search  for  the  best  k  and  £  (k,£  =  1,2,...),  re-using  the 
arrangement  so  far. 

We  present  each  in  the  following  sections. 

4.1  Alternating  Minimization  (ReGroup) 

Suppose  we  are  given  the  number  of  row  groups  k  and  the  number  of  column  groups  £  and  are  are  interested  in 
finding  a  cross-association  (T*.  4>*)  that  minimizes 

k  i 

(3) 

i=  1  j= 1 

where  Dij  are  the  cross-associates  of  D,  given  (T*.  $*).  We  now  outline  a  simple  and  efficient  alternating 
minimization  algorithm  that  yields  a  local  minimum  of  Eq.  3.  We  should  note  that,  in  the  regions  we  typically 
perform  the  search,  the  code  cost  dominates  the  total  cost  by  far  (see  also  Figure  3  and  Section  5. 1),  which  justifies 
this  choice. 

Algorithm  ReGroup:  _ 

1.  Let  t  denote  the  iteration  index.  Initially,  set  t  =  0.  Start  with  an  arbitrary  cross-association  (\k)  of 
the  matrix  D  into  k  row  groups  and  £  column  groups.  For  this  initial  partition,  compute  the  cross-associate 
matrices  !)[ ; ,  and  corresponding  distributions  PDt  =  /J/J . 
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2.  For  this  step,  we  will  hold  column  assignments,  i.e.,  <!>*,  fixed.  For  every  row  x,  splice  it  into  I  parts,  each 
corresponding  to  one  of  the  column  groups.  Denote  them  as  a;1, . . . ,  ar.  For  each  of  these  parts,  compute 
nu(xi),  u  =  0, 1,  and  j  =  Now,  assign  row  x  to  that  row  group  'jF+1  such  that,  for  all  1  <  i  <  k: 


e  i 


E  E  n^x3)  los 


j= 1 u= 0 


j—l  u—0 


sEE"  u(x^)  log 


PUU) 


(4) 


3.  With  respect  to  cross-association  (T/+l ,  <!>*),  recompute  the  matrices  1 ,  and  corresponding  distributions 


p  —  pt+1 

^Dt+1  =  U,j  ‘ 


4-5.  Similar  to  steps  2-3,  but  swapping  columns  instead  and  producing  a  new  cross-association  (\Fi+1,  4>,+2)  and 


—  pt+2 


corresponding  cross-associates  with  distributions  PDt+ 2  =  P\ 

’■4  Ui,3 

6.  If  there  is  no  decrease  in  total  cost,  stop;  otherwise,  set  t  =  t  +  2,  go  to  step  2,  and  iterate. 


Figure  2  shows  the  alternating  minimization  algorithm  in  action.  The  graph  consists  of  three  square  sub-matrices 
(“caves”  [30])  with  sizes  280,  180  and  90,  plus  1%  noise.  We  permute  this  matrix  and  tty  to  recover  its  structure. 
As  expected,  for  k  =  £  =  3,  the  algorithm  discovers  the  correct  cross-associations.  It  is  also  clear  that  the  algorithm 
finds  progressively  better  representations  of  the  matrix. 

Theorem  4.1.  For  t  >  1, 


E  E  cm  a  E  E  CWJ1)  a  E  E  C(D,‘)2)- 

i=  1  j= 1  i= 1  j= 1  i=l  j= 1 

In  words,  REGROUP  never  increases  the  objective  function  (Ecj.  3). 

Proof.  We  shall  only  prove  the  first  inequality,  the  second  inequality  will  follow  by  symmetry  between  rows  and 
columns. 


e  zcim 

i=i  j= i 

k  t  1 

=  EEEn^jDl)1°g  p^{u] 

i=lj=lu=0  ’ 


k  t  1 

=  EEE 

i=  1  j= 1  w=0 
k 

=  E  E 

i=l 

>E  E 

i=l  (:r)=z 


E  n„(a4)  |  log 

rr:\I/£  (#)=£ 

€  1 

^^nu(a4)log 

j=l  tt— 0 
€  1 

^^nu(a4)log 


plj(u) 

1 


(&) 


E  E 

i= 1  a;:'I't+1(a:)=i 


j  =  l  «=0 

I  1 

EE"  u(x^) log 

j=l  it=0 


pt 


Total  cost  vs.  #cross-associates 


Total  cost  vs.  #cross-associates 


(a)  Total  cost  (surface) 


Code  cost  vs.  #cross-associates 


(d)  Code  cost  (surface) 


(b)  Total  cost  (contour) 


5  10  15  20  25  30  35  40  45  50  55 

k 


(e)  Code  cost  (contour) 


Total  cost  vs.  #cross-associates 


Code  cost  vs.  #cross-associates 


(f)  Code  cost  (diagonal) 


Figure  3:  General  shape  of  the  total  cost  (number  of  bits)  versus  number  of  cross-associates  (synthetic  cave  graph 
with  three  square  caves  of  sizes  32,  16  and  8,  with  1%  noise).  The  “waterfall”  shape  (with  the  description  and  code 
costs  dominating  the  total  cost  in  different  regions)  illustrates  the  intuition  behind  our  model,  as  well  as  why  our 
minimization  strategy  is  effective. 


n^x3"> 

x:'S/t+1(x)=i 


k  i  1 

£££ 

i=  1  j= 1  u= 0 

EEE^^S1)1^ 

i—  1  j= 1  u= 0 
k  t  1 


log 


Ptj(u) 


pi >) 


^  E  E  E  ^g 

i=l  J=1  u=0  i,j 

=  EE«dS1) 

1=1  j= 1 

where  (a)  follows  from  Step  2  of  ReGroup;  (b)  follows  by  re-writing  the  outer  two  sums-since  i  is  not  used 
anywhere  inside  the  [•  ■  ■  ]  terms;  and  (c)  follows  from  the  non-negativity  of  the  Kullback-Leibler  distance.  □ 


Remarks  Instead  of  batch  updates,  sequential  updates  are  also  possible.  Also,  rows  and  columns  need  not 
alternate  in  the  minimization.  We  have  many  locally  good  moves  available  (based  on  Theorem  4.1)  which  require 
only  linear  time. 
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It  is  possible  that  ReGroup  may  cause  some  groups  to  be  empty,  i.e.,  a*  =  0  or  bj  =  0  for  some  1  <  i  <  m, 
1  <  j  <  n  (to  see  that,  consider  e.g.,  a  homogeneous  matrix;  then  we  always  end  up  with  one  group).  In  other 
words,  we  may  find  k  and  £  less  than  those  specified.  Finally,  we  can  easily  avoid  infinite  quantities  in  Eq.  4  by 
using,  e.g.,  (nu(A)  +  1/2 )/{n(A)  +  1)  for  Pa(u),  u  =  0,1. 

Initialization  If  we  want  to  use  ReGroup  (inner  loop)  by  itself,  we  have  to  initialize  the  mappings  (4>,  \F).  For 
4>,  the  simplest  approach  is  to  divide  the  rows  evenly  into  k  initial  “groups,”  taking  them  in  their  original  order. 
For  'F  we  do  the  initialization  in  the  same  manner.  This  often  works  well  in  practice.  A  better  approach  is  to  divide 
the  “residual  masses”  (i.e.,  marginal  sums  of  each  column)  evenly  among  k  groups,  taking  the  rows  in  order  of 
increasing  mass  (and  similarly  for  \F).  The  initialization  in  Figure  2  is  mass-based. 

However,  our  CrossAssociationSearch  (outer  loop)  algorithm,  described  in  the  next  section,  is  an  even 
better  alternative.  We  staid  with  k  =  £  =  1,  increase  k  and  £  and  create  new  groups,  taking  into  account  the 
cross-associations  up  to  that  point.  This  tightly  integrated  group  creation  scheme,  that  reuses  current  ReGroup 
row  and  column  group  assignments,  yields  much  better  results. 


Complexity  The  algorithm  is  0{n\{D)  ■  (k  +  £)  •  /)  where  /  is  the  number  of  iterations.  In  step  (2)  of  the 
algorithm,  we  access  each  row  and  count  their  nonzero  elements  (of  which  there  are  ri\  (d)  in  total),  then  consider 
k  possible  candidate  row  groups  to  place  it  into.  Therefore,  an  iteration  over  rows  is  ()(n\  (I))  ■  k).  Similarly,  an 
iteration  over  columns  (step  4)  is  0(ni(D)  ■  i).  There  is  a  total  of  1/2  row  and  1/2  column  iterations.  All  this 
adds  up  to  0(ni(D)  •  (k  +  £)  ■  I). 


4.2  Search  for  k  and  £  (CrossAssociationSearch) 


The  last  paid  of  our  approach  is  an  algorithm  to  look  for  good  values  of  k  and  l.  Based  on  our  cost  model  (Eq.  2), 
we  have  a  way  to  attack  this  problem.  As  we  discuss  later,  the  cost  function  usually  has  a  “waterfall”  shape  (see 
Figure  3),  with  a  sharp  drop  for  small  k  and  £,  and  an  ascent  afterwards.  Thus,  it  makes  sense  to  staid  with  small 
values  of  k,  £,  progressively  increase  them,  and  keep  rearranging  rows  and  columns  based  on  fast,  local  moves  in 
the  search  space  (ReGroup).  We  experimented  with  several  search  strategies,  and  obtained  good  results  with  the 
following  algorithm. 


Algorithm  CrossAssociationSearch:  _ 

1.  Fet  T  denote  the  search  iteration  index.  Staid  with  T  =  0  and  k°  =  £°  =  1. 

2.  [Outer  loop]  At  iteration  T,  try  to  increase  the  number  of  row  groups.  Set  k1  11  =  kT  +  1.  Split  the  row 
group  r  with  maximum  entropy  per  row,  i.e., 


r 


:=  arg  max 

l<l<k 


n(Dij)H(PDitj(  0)) 

Qj{ 


Construct  an  initial  label  map  'F^+1  as  follows:  For  every  row  x  in  row  group  r  (i.e.,  for  every  1  <  x  <  m 
such  that  'Ft(x)  =  r),  place  it  into  the  new  group  kT+1  (i.e.,  set  'F^+1(x)  =  kT+l)  if  and  only  if  it  decreases 
the  per-row  entropy  of  group  r,  i.e.,  if  and  only  if 


E 

i  <j<t 


n(D'rjlH(PD,J0)) 

CLy  1 


<  E 

i<j<e 


n  (  Dr,  j )  H  (  PDrj  (  0  )  ) 

Cif 


(5) 


where  D'r  ■  is  Drj  without  row  x.  Otherwise,  we  let  T(/+1  (x)  =  r  =  \Fr(x).  If  we  move  the  row  to  the 
new  group,  we  update  Dr.j  (for  all  1  <  j  <  £)  by  removing  row  x  (for  subsequent  estimations  of  Eq.  5). 
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Dataset 

Dim.  (a  x  b) 

ni(A) 

CAVE 

810x900 

162,000 

CAVE -No isy 

810x900 

171,741 

CUSTPROD 

295x30 

5,820 

CUSTPROD-Noisy 

295x30 

5,602 

NOISE 

100x100 

952 

CLASSIC 

3,893x4,303 

176,347 

GRANTS 

13,297x5,298 

805,063 

EPINIONS 

75,888x75,888 

508, 960 

CLICKSTREAM 

23,396x199,308 

952,580 

OREGON 

11,461x11,461 

65,460 

Table  2:  Dataset  characteristics. 

3.  [Inner  loop]  Use  ReGroup  with  initial  cross-associations  (\Fq +1,  <f>T)  to  find  new  ones  (T'r+1,  <F7  +1)  and 
the  corresponding  total  cost. 

4.  If  there  is  no  decrease  in  total  cost,  stop  and  return  (k*,£*)  =  (k1,  (T) — with  coiTesponding  cross-associa¬ 
tions  (4'T,  $T).  Otherwise,  set  T  =  T  +  1  and  continue. 

5-7.  Similar  to  steps  2-4,  but  trying  to  increase  column  groups  instead. 


Figure  1  shows  the  search  algorithm  in  action.  Starting  from  the  initial  matrix  (CAVES),  we  successively  increase 
the  number  of  column  and  row  groups.  For  each  such  increase,  the  columns  arc  shifted  using  ReGroup.  The 
algorithm  successfully  stops  after  iteration  pair  4  (Figure  1(e)). 

Lemma  4.1.  If  D  =  [DiD2],  then  C{Di)  +  C(D2)  <  C(D). 

Proof.  We  have 


C(D)  =  n(D)H(PD( 0))  =  n(D)H 

„,nwr  ( PdAOMD^  +  PD2(0)n(D2)\ 

=  (  ’  l - FO) - ) 

>n(D1)H(PDl(0))+n(D2)H(PDl(0)) 

=  C(D1)  +  C(D2), 

where  the  inequality  follows  from  the  concavity  of  Hf)  and  the  fact  that  n(  D  j )  +n(  I)> )  =  n(D)  orn(/7|  )/n(I))  + 
n(D2)/n(D)  =  1.  □ 

Note  that  the  original  code  cost  is  zero  only  for  a  completely  homogeneous  matrix.  Also,  the  code  length  for 
( k ,  l)  =  (a,  b)  is,  by  definition,  zero.  Therefore,  provided  that  the  fraction  of  non-zeros  is  not  the  same  for  every 
column  (and  since  HQ  is  strictly  concave),  the  next  observation  follows  immediately. 

Corollary  4.1.  For  any  k\  >  k2  and  t\  >  £%  there  exists  cross-associations  such  that  ( k \.t\  )  leads  to  a  shorter 
code  (Eq.  3). 
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Figure  4:  Cross-associations  on  synthetic  datasets :  Our  method  gives  the  intuitively  correct  cross-associations  for 
(a)  CAVE  and  (b)  CUSTPROD.  In  the  noisy  versions  (c,  d),  few  extra  groups  are  found  due  to  patterns  that  emerge, 
such  as  the  “almost-empty”  and  “more-dense”  cross-associations  for  pure  NOISE  (e). 


By  Corollary  4.1,  the  outer  loop  in  CrossAssociationSearch  decreases  the  objective  cost  function.  By 
Theorem  4.1  the  same  holds  for  the  inner  loop  (ReGroup).  Therefore,  the  entire  algorithm  CROSS  ASSOCIA¬ 
TIONS  EARCH  also  decreases  the  objective  cost  function  (Eq.  3).  Flowever,  the  description  complexity  evidently 
increases  with  (k.  £).  We  have  found  that,  in  practice,  this  search  strategy  performs  very  well.  Figure  3  (discussed 
in  Section  5.1)  provides  an  indication  why  this  is  so. 

Complexity  Since  at  each  step  of  the  search  we  increase  either  k  or  £,  the  sum  k  +  t  always  increases  by  one. 
Therefore,  the  overall  complexity  of  the  search  is  0(n\{D)(k*  +  £*)2),  if  we  ignore  the  number  of  ReGroup 
iterations  /  (in  practice,  I  <  20  is  always  sufficient). 

5  Experiments 

We  did  experiments  to  answer  two  key  questions:  (i)  how  good  is  the  quality  of  the  results  (which  involves  both  the 
proposed  criterion  and  the  minimization  strategy),  and  (ii)  how  well  does  the  method  scale  up.  To  the  best  of  our 
knowledge,  in  the  literature  to  date,  no  other  method  has  been  explicitly  proposed  and  studied  for  parameter-free, 
joint  clustering  of  binary  matrices. 

We  used  several  datasets  (see  Table  2),  both  real  and  synthetic.  The  synthetic  ones  were:  (1)  CAVE,  representing 
a  social  network  of  “cavemen”  [30],  that  is,  a  block-diagonal  matrix  of  variable-size  blocks  (or  “caves”),  (2) 
CUSTPROD,  representing  groups  of  customers  and  their  buying  preferences2,  (3)  NOISE,  with  pure  white  noise. 
We  also  created  noisy  versions  of  CAVE  and  CUSTPROD  (CAVE-Noisy  and  CUSTPROD-Noisy),  by  adding 
noise  (10%  of  the  number  of  non- zeros). 

The  real  datasets  are:  (1)  CLASSIC,  Usenet  documents  (Cornell’s  SMART  collection  [10]),  (2)  GRANTS, 
13,297  documents  (NSF  grant  proposal  abstracts)  from  several  disciplines  (physics,  bio-informatics,  etc.),  (3) 
EPINIONS,  a  who-trusts-whom  social  graph  of  www.epinions.com  users  [31],  (4)  CLICKSTREAM,  with 
users  and  URLs  they  clicked  on  [32],  and  (5)  OREGON,  with  connections  between  Autonomous  Systems  (AS)  in 
the  Internet. 

Our  implementation  was  done  in  MATLAB  (version  6.5  on  Linux)  using  sparse  matrices.  The  experiments 
were  performed  on  an  Intel  Xeon  2.8GHz  machine  with  1GB  RAM. 

2We  try  to  capture  market  segments  with  heavily  overlapping  product  preferences,  like,  say,  “single  persons”,  buying  beer  and  chips, 
“couples,”  buying  the  above  plus  frozen  dinners,  “families,”  buying  all  the  above  plus  milk,  etc. 


12 


insipidus,  alveolar,  aortic,  blood,  disease,  clinical,  shape,  nasa,  leading, 
death,  prognosis,  intravenous  cep?  tissue,  patient  assumed,  thin 

V 


manifolds,  operators, 
harmonic,  operator,  topological 


undergraduate,  education, 
natipnal,  projects 


3500 

MEDLINE 


cisf 

I  2  2000 


1000 

CRANFIELD 


150ft  2000  2500 

'N.  Column  Clusters 

providing,  studying,  records,  \  paint,  examination,  fall, 

developments,  students,  abstract,  notation,  works  raise,  leave,  based 
rules,  community  construct,  bibliographies 

(a)  CLASSIC  cross-associates  ( k *  =  15,  £*  =  19) 


encoding,  characters, 
bind,  nucleus, 
recombination 


coupling,  deposition, 
plasma,  separation,  beam 


meetings,  organizations, 
session,  participating 


(b)  GRANTS  cross-associates  ( k *  =  41,1*  =  28) 


Figure  5:  Cross-associations  for  CLASSIC  and  GRANTS:  Due  to  the  dataset  sizes,  we  show  the  Cross-associations 
via  shading;  darker  shades  correspond  denser  blocks  (more  ones).  We  also  show  the  most  frequently  occurring 
words  for  several  of  the  word  (column)  groups. 

5.1  Quality 

Total  code  length  criterion  Figure  3  illustrates  the  intuition  behind  both  our  information-theoretic  cost  model, 
as  well  as  our  minimization  strategy.  It  shows  the  general  shape  of  the  total  cost  (in  number  of  bits)  versus  the 
number  of  cross-associates.  For  this  graph,  we  used  a  “caveman”  matrix  with  three  caves  of  sizes  32,  16  and  8, 
adding  noise  (1%  of  non-zeros).  We  used  ReGroup,  forcing  it  to  never  empty  a  group.  The  slight  local  jaggedness 
in  the  plots  is  due  to  the  presence  of  noise  and  occasional  local  minima  hit  by  ReGroup. 

However,  the  figure  reveals  nicely  the  overall,  global  shape  of  the  cost  function.  It  has  a  “waterfall”  shape, 
dropping  very  fast  initially,  then  rising  again  as  the  number  of  cross-associates  increases.  For  small  k,£,  the  code 
cost  dominates  the  description  cost  (in  bits),  while  for  large  k,  l  the  description  cost  is  the  dominant  one.  The  key 
points,  regarding  the  model  as  well  as  the  search  strategies,  are: 

•  The  optimal  {k* ,  £*)  is  the  “sweet  spot”  balancing  these  two.  The  trade-off  between  description  complexity 
and  code  length  indeed  has  a  desirable  form,  as  expected. 

•  As  expected,  cost  iso-surfaces  roughly  correspond  to  k  ■  i  =  const.,  i.e.,  to  constant  number  of  cross¬ 
associates. 

•  Moreover,  for  relatively  small  ( k ,  £),  the  code  cost  clearly  dominates  the  total  cost  by  far,  which  justifies  our 
choice  of  objective  function  (Eq.  3). 

•  The  overall,  well-behaved  shape  also  demonstrates  that  the  cost  model  is  amenable  to  efficient  search  for  a 
minimum,  based  on  the  proposed  linear-time,  local  moves. 

•  It  also  justifies  why  starting  the  search  with  k  =  i  =  1  and  gradually  increasing  them  is  an  effective 
approach:  we  generally  find  the  minimum  after  a  few  CrossAssociationSearch  (outer  loop)  iterations. 
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Clusters 

Document  class 

found 

CRANFIELD 

CISI 

MEDLINE 

Precision 

1 

0 

1 

390 

0.997 

2 

2 

676 

9 

0.984 

3 

0 

0 

610 

1.000 

4 

1 

317 

6 

0.978 

5 

188 

0 

0 

1.000 

6 

207 

0 

0 

1.000 

7 

3 

452 

16 

0.960 

8 

131 

0 

0 

1.000 

9 

209 

0 

0 

1.000 

10 

107 

2 

0 

0.982 

11 

152 

3 

2 

0.968 

12 

74 

0 

0 

1.000 

13 

139 

9 

0 

0.939 

14 

163 

0 

0 

1.000 

15 

24 

0 

0 

1.000 

Recall 

0.996 

0.990 

0.968 

Table  3:  The  clusters  for  CLASSIC  (see  Figure  5(a))  recover  the  known  document  classes.  Furthermore,  our 
approach  also  captures  unknown  structure  (such  as  the  “technical”  and  “everyday”  medical  terms). 


Results — synthetic  data  Figure  4  depicts  the  cross-associations  found  by  our  method  on  several  synthetic 
datasets.  For  the  noise-free  synthetic  matrices  CAVE  and  CUSTPROD,  we  get  exactly  the  intuitively  correct  groups. 
This  serves  as  a  sanity  check  for  our  whole  approach  (criterion  plus  heuristics).  When  noise  is  present,  we  find 
some  extra  groups  which,  on  closer  examination,  are  picking  up  patterns  in  the  noise.  This  is  expected:  it  is  well 
known  that  spurious  patterns  emerge,  even  when  we  have  pure  noise.  Figure  4(e)  confirms  it:  even  in  the  NOISE 
matrix,  our  algorithm  finds  blocks  of  clearly  lower  or  higher  density. 

Results — real  data  Figures  5  and  6  show  the  cross-associations  found  on  several  real-world  datasets.  They 
demonstrate  that  our  method  gives  intuitive  results. 

Figure  5(a)  shows  the  CLASSIC  dataset,  where  the  rows  correspond  to  documents  from  MEDLINE  (medical 
journals),  CISI  (information  retrieval)  and  CRANFIELD  (aerodynamics);  and  the  columns  correspond  to  words. 

First,  we  observe  that  the  cross-associates  arc  in  agreement  with  the  known  document  classes  (left  axis  annota¬ 
tions).  We  also  annotated  some  of  the  column  groups  with  their  most  frequent  words.  Cross-associates  belonging 
to  the  same  document  (row)  group  clearly  follow  similar  patterns  with  respect  to  the  word  (column)  groups.  For 
example,  the  MEDLINE  row  groups  arc  most  strongly  related  to  the  first  and  second  column  groups,  both  of  which 
arc  related  to  medicine,  (“insipidus,”  “alveolar,”  “prognosis”  in  the  first  column  group;  “blood,”  “disease,”  “cell,” 
etc,  in  the  second). 

Besides  being  in  agreement  with  the  known  document  classes,  the  cross-associates  reveal  further  structure 
(see  Table  3).  For  example,  the  first  word  group  consists  of  more  “technical”  medical  terms,  while  second  group 
consists  of  “everyday”  terms,  or  terms  that  arc  used  in  medicine  often,  but  not  exclusively3.  Thus,  the  second  word 
group  is  more  likely  to  show  up  in  other  document  groups  (and  indeed  it  does,  although  not  immediately  apparent 
in  the  figure),  which  is  why  our  algorithm  separates  the  two. 

3This  observation  is  also  true  for  nearly  all  of  the  (approximately)  600  and  100  words  belonging  to  each  group,  not  only  the  most 
frequent  ones  shown  here. 
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Small  but  dense  cluster 


Figure  6:  Cross-associations  for  EPINIONS,  OREGON  and  CLICKSTREAM:  The  matrices  arc  organized  success¬ 
fully  in  homogeneous  regions,  (d)  shows  that  our  method  captures  dense  clusters,  irrespective  of  their  size. 

Figure  5(b)  shows  GRANTS,  which  consists  of  NSF  grant  proposal  abstracts  in  several  disciplines,  such  as 
genetics,  mathematics,  physics,  organizational  studies.  Again,  the  terms  arc  meaningfully  grouped:  e.g.,  those 
related  to  biology  (“encoding,”  “recombination,”  etc.),  to  physics  (“coupling,”  “plasma,”  etc.)  and  to  material 
sciences. 

We  also  present  experiments  on  matrices  from  other  settings:  social  networks  (EPINIONS),  computer  net¬ 
works  (OREGON)  and  web  visit  patterns  (CLICKSTREAM).  In  all  cases,  our  algorithm  organizes  the  matrices  in 
homogeneous  regions.  Also,  in  EPINIONS,  notice  that  there  is  a  small  but  dense  cluster,  probably  correspond¬ 
ing  to  a  dense  clique  of  experts  that  they  mainly  trust  each  other.  The  large  gray  rectangle  should  correspond  to 
another,  much  larger,  but  less  coherent,  group  of  people. 

Compression  and  density  Figure  7  lists  the  compression  ratios  achieved  by  our  cross-association  algorithms 
for  each  dataset.  Figure  8  shows  how  our  algorithm  effectively  divides  the  CLASSIC  matrix  in  sparse  and  dense 
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Dataset 

Average  cost  per  element 

i—l 

II 

II 

Optimal  k*,t 

CAVE 

0.766 

0.00065 

(1:1178) 

CAVE-Noisy 

0.788 

0.1537 

(1:5.1) 

CUSTPROD 

0.930 

0.0320 

(1:29) 

CUSTPROD-Noisy 

0.952 

0.3814 

(1:2.5) 

NOISE 

0.2846 

0.2748 

(1:1.03) 

CLASSIC 

0.0843 

0.0688 

(1:1.23) 

GRANTS 

0.0901 

0.0751 

(1:1.20) 

EPINIONS 

0.0013 

0.00081 

(1:1.60) 

OREGON 

0.0062 

0.0037 

(1:1.68) 

CLICKSTREAM 

0.0028 

0.0019 

(1:1.47) 

(a)  All  datasets 


Compression  ratio  (optimal  k  ,1 ) 

1.8  I - T - T - T - T - T - T 


(b)  Real  datasets 


Figure  7:  Compression  ratios. 


regions  (i.e.,  cross-associates),  thereby  summarizing  its  structure. 

5.2  Scalability 

Figure  9  shows  wall-clock  times  (in  seconds)  of  our  MATLAB  implementation.  In  all  plots,  the  datasets  were  cave- 
graphs  with  three  caves.  For  the  noiseless  case  (b),  times  for  both  ReGroup  and  CrossAssociationSearch 
increase  linearly  with  respect  to  number  of  non-zeros.  We  observe  similar  behavior  for  the  noisy  case  (c).  The 
“sawtooth”  patterns  are  explained  by  the  fact  that  we  used  a  new  matrix  for  each  case.  Thus,  it  was  possible  for 
some  graphs  to  have  different  “regularity”  (spuriously  emerging  patterns),  and  thus  compress  better  and  faster. 
Indeed,  when  we  approximately  scale  by  the  number  of  inner  loop  iterations  in  CrossAssociationSearch,  an 
overall  linear  trend  (with  variance  due  to  memory  access  overheads  in  MATLAB)  appears. 

Finally,  Figure  10  shows  the  progression  of  total  cost  (in  bits)  for  every  iteration  of  CrossAssociation¬ 
Search  (outer  loop).  We  clearly  see  that  our  algorithm  quickly  finds  better  cross-associations.  These  plots  are 
from  the  same  wall-clock  time  experiments. 

6  Conclusions 

We  have  proposed  one  of  the  few  methods  for  clustering  and  graph  partitioning,  that  needs  no  “magic  numbers.” 

•  Besides  being  fully  automatic,  our  approach  satisfies  all  properties  (P1)-(P3):  it  finds  row  and  column  groups 
simultaneously  and  scales  linearly  with  problem  size. 

•  We  introduce  a  novel  approach  and  propose  a  general,  intuitive  model  founded  on  compression  and  information- 
theoretic  principles. 

•  We  provide  an  integrated,  two-level  framework  to  find  cross-associations,  consisting  of  ReGroup  (inner 
loop)  and  CrossAssociationSearch  (outer  loop). 

•  We  give  an  effective  search  strategy  to  minimize  the  total  code  length,  taking  advantage  of  the  cost  function 
properties  (“waterfall”  shape). 

Also,  our  method  is  easily  extensible  to  matrices  with  categorical  values.  We  evaluate  our  method  on  several  real 
and  synthetic  datasets,  where  it  produces  intuitive  results. 
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Number  of  cells 


Figure  8:  The  algorithm  splits  the  original  CLASSIC  matrix  into  homogeneous,  high  and  low  density  Cross¬ 
associations. 
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